Los siguientes métodos me proporcione un a manera de comparar modelos o mirar su acertividad individual. Los dos primeros métodos son usados a la hora de clasificar y el último a la hora de regresar.
Con el propósito de medir la calidad del modelo, después de realizar la validación cruzada podemos construir una matriz que nos permite visualizar de forma sencilla que tan bien clasificó nuestro modelo.
library(kableExtra)
## Warning: package 'kableExtra' was built under R version 3.5.2
base <- read.csv("../Bases de datos/college-perf.csv")
kable(head(base),"markdown")
SAT | GPA | Projects | Community | Income | Perf | Pred |
---|---|---|---|---|---|---|
1380 | 2.53 | 1 | 0 | 41800 | Low | Low |
1100 | 3.18 | 1 | 5 | 37600 | Low | Low |
1110 | 2.73 | 2 | 10 | 34800 | Medium | Medium |
1180 | 2.49 | 3 | 0 | 24100 | Low | High |
1240 | 2.89 | 3 | 5 | 56000 | Medium | Medium |
1140 | 2.85 | 2 | 0 | 50800 | Low | Low |
tabla <- table(base$Perf, base$Pred, dnn = c("Actual", "predicho"))
kable(tabla, "markdown")
High | Low | Medium | |
---|---|---|---|
High | 458 | 35 | 38 |
Low | 98 | 1150 | 84 |
Medium | 170 | 166 | 1801 |
Nota:La diagonal principal me muestra las casos donde se acertó y los valores que estén fuera de esta muestra la información que está herrada.
Si queremos tener la información aterior representado como una proposión.
kable(prop.table(tabla),"markdown")
High | Low | Medium | |
---|---|---|---|
High | 0.1145 | 0.00875 | 0.00950 |
Low | 0.0245 | 0.28750 | 0.02100 |
Medium | 0.0425 | 0.04150 | 0.45025 |
barplot(tabla, col = c(4,5,6), legend = T, main = "Matriz de confusión de forma gráfica")
La curva ROC representa la capacidad de un clasificador binario para distinguir entre categorias. Con una curva ROC se busca la viabilidad en nuestra clasificación, en otras palabras el modelo que ofrezca una mayor área bajo la curva presenta una mejor tasa de predicción. Además me proporciona una forma de encontrar un punto de corte para una clasificación.
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.5.2
## Warning: package 'gplots' was built under R version 3.5.2
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.5.2
base1 <- read.csv("../Bases de datos/roc-example-1.csv")
kable(head(base1),"markdown")
prob | class |
---|---|
0.9917340 | 1 |
0.9768288 | 1 |
0.9763148 | 1 |
0.9601505 | 1 |
0.9351574 | 1 |
0.9335989 | 1 |
pred1 <- prediction(base1$prob, base1$class)
perf1 <- performance(pred1, "tpr", "fpr")
plot(perf1)
lines(par()$usr[1:2], par()$usr[3:4])
El siguiente gráfico presenta de forma intuitiva la utilidad de la curva ROC y la importancia a la hora de encontrar un punto de corte de categorías óptimo.
r1 <- base1[base1$class ==1,]
r1 <- r1[1:46,]
r2 <- base1[base1$class ==0,]
ggplot(r1, aes(r1$prob))+geom_density(aes(fill ="red" ),alpha = 0.5) + geom_density(aes(r2$prob, fill ="purple" , alpha = 0.5), show.legend = F)+ labs(x = "Punto de corte") + scale_fill_discrete(name = "Categoria",labels = c("Enfermos","No enfermos","corte"))+ geom_vline(aes(xintercept = 0.5, color = "red"))+geom_vline(aes(xintercept = 0.8, color = "red"))+ geom_vline(aes(xintercept = 0, color = "red"))+ geom_vline(aes(xintercept = 1, color = "red")) + theme_bw(base_family = "Courier")
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call.graphics(C_text, as.graphicsAnnot(x$label), x$x,
## x$y, : font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
## Warning in grid.Call(C_textBounds, as.graphicsAnnot(x$label), x$x, x$y, :
## font family not found in Windows font database
prob.corte <- data.frame(cut = perf1@alpha.values[[1]],
fpr = perf1@x.values[[1]],
tpr = perf1@y.values[[1]])
Observamos algunos puntos de corte y sus tasas de verdaderos positivos y falsos positivos.
kable(head(prob.corte),"markdown")
cut | fpr | tpr |
---|---|---|
Inf | 0 | 0.0000000 |
0.9917340 | 0 | 0.0185185 |
0.9768288 | 0 | 0.0370370 |
0.9763148 | 0 | 0.0555556 |
0.9601505 | 0 | 0.0740741 |
0.9351574 | 0 | 0.0925926 |
Elegimos una tasa de verdaderos positivos mayor a 0.8
kable(head(prob.corte[prob.corte$tpr>=0.8,]),"markdown")
cut | fpr | tpr | |
---|---|---|---|
55 | 0.4981506 | 0.2173913 | 0.8148148 |
56 | 0.4961696 | 0.2173913 | 0.8333333 |
57 | 0.4784074 | 0.2391304 | 0.8333333 |
58 | 0.4775468 | 0.2608696 | 0.8333333 |
59 | 0.4632342 | 0.2826087 | 0.8333333 |
60 | 0.4522735 | 0.2826087 | 0.8518519 |
Este método es útil para comparar modelos de regresión. Se busca que este error sea lo más pequeño posible.
La formula para este error esta dada por :
\[ECM = \frac{1}{n}\sum_{i=1}^n(y_i-\hat{y}_i)^2\]
mean((y_real-y_predicho)^2)